Probabilistic and possibilistic language models based on the world wide web
نویسندگان
چکیده
Usually, language models are built either from a closed corpus, or by using World Wide Web retrieved documents, which are considered as a closed corpus themselves. In this paper we propose several other ways, more adapted to the nature of the Web, of using this resource for language modeling. We first start by improving an approach consisting in estimating n-gram probabilities from Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic framework. Then, we also propose to rely on Possibility Theory for effectively using this kind of information. We compare these two approaches on two automatic speech recognition tasks: (i) transcribing broadcast news data, and (ii) transcribing domain-specific data, concerning surgical operation film comments. We show that the two approaches are effective in different situations.
منابع مشابه
Web-based possibilistic language models for automatic speech recognition
This paper describes a new kind of language models based on the Possibility Theory. The purpose of these new models is to better use the data available on the Web for language modeling. These models aim to integrate information relative to impossible word sequences. We address the two main problems of using this kind of model: how to estimate the measures for word sequences and how to integrate...
متن کاملModèles de langage ad hoc pour la reconnaissance automatique de la parole. (Ad-hoc language models for automatic speech recognition)
The three pillars of an automatic speech recognition system are the lexicon, the language model and the acoustic model. The lexicon provides all the words that can be transcribed, associated with their pronunciation. The acoustic model provides an indication of how the phone units are pronounced, and the language model brings the knowledge of how words are linked. In modern automatic speech rec...
متن کاملA Retrieval Method Based on Language Model Considering Neighboring Contents
The World Wide Web (WWW) has a massive number of Web pages, so that it is difficult for users to get useful information. In recent years, however, it is said that the probabilistic language model can help to improve retrieval accuracies of some kinds of search engines. The probabilistic language model has statistical background and can adapt previous text information retrieval model. However, w...
متن کاملCLIR using a Probabilistic Translation Model based on Web Documents
In this report, we describe the approach we used in TREC-8 Cross-Language IR (CLIR) track. The approach is based on probabilistic translation models estimated from two parallel training corpora: one established manually, and the other built automatically with the documents mined from the Web. We describe the principle of model building, the mining of parallel texts, as well as some preliminary ...
متن کاملCombination of probabilistic and possibilistic language models
In a previous paper we proposed Web-based language models relying on the possibility theory. These models explicitly represent the possibility of word sequences. In this paper we propose to find the best way of combining this kind of model with classical probabilistic models, in the context of automatic speech recognition. We propose several combination approaches, depending on the nature of th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009